Empirical distribution function

In statistics, the empirical distribution function, or empirical cdf, is the cumulative distribution function associated with the empirical measure of the sample. This cdf is a step function that jumps up by 1/n at each of the n data points. The empirical distribution function estimates the true underlying cdf of the points in the sample. A number of results exist which allow to quantify the rate of convergence of the empirical cdf to its limit.

1 Definition
2 Asymptotic properties
3 See also
4 References
- 4.1 Notes

Definition

Let (x₁, …, x_n) be iid real random variables with the common cdf F(t). Then the empirical distribution function is defined as ^[1]

$\hat F_n(t) = \frac{ \mbox{number of elements in the sample} \leq t}n = \frac{1}{n} \sum_{i=1}^n \mathbf{1}\{x_i \le t\},$

where 1{A} is the indicator of event A. For a fixed t, the indicator 1{x_i ≤ t} is a Bernoulli random variable with parameter p = F(t), hence $\scriptstyle n \hat F_n(t)$ is a binomial random variable with mean nF(t) and variance nF(t)(1 − F(t)). This implies that $\scriptstyle \hat F_n(t)$ is an unbiased estimator for F(t).

Asymptotic properties

By the strong law of large numbers, the estimator $\scriptstyle\hat{F}_n(t)$ converges to F(t) as n → ∞ almost surely, for every value of t: ^[2]

$\hat F_n(t)\ \xrightarrow{a.s.}\ F(t),$

thus the estimator $\scriptstyle\hat{F}_n(t)$ is consistent. This expression asserts the pointwise convergence of the empirical distribution function to the true cdf. There is a stronger result, called the Glivenko–Cantelli theorem, which states that the convergence in fact happens uniformly over t: ^[3]

$\|\hat F_n-F\|_\infty \equiv \sup_{t\in\mathbb{R}} \big|\hat F_n(t)-F(t)\big|\ \xrightarrow{a.s.}\ 0.$

The sup-norm in this expression is called the Kolmogorov–Smirnov statistic for testing the goodness-of-fit between the empirical distribution $\scriptstyle\hat{F}_n(t)$ and the assumed true cdf F. Other norm functions may be reasonably used here instead of the sup-norm. For example, the L²-norm gives rise to the Cramér–von Mises statistic.

The asymptotic distribution can be further characterized in several different ways. First, the central limit theorem states that pointwise, $\scriptstyle\hat{F}_n(t)$ has asymptotically normal distribution with the standard √n rate of convergence: ^[2]

$\sqrt{n}\big(\hat F_n(t) - F(t)\big)\ \ \xrightarrow{d}\ \ \mathcal{N}\Big( 0, F(t)\big(1-F(t)\big) \Big).$

This result is extended by the Donsker’s theorem, which asserts that the empirical process $\scriptstyle\sqrt{n}(\hat{F}_n - F)$ , viewed as a function indexed by t ∈ R, converges in distribution in the Skorokhod space D[−∞, +∞] to the mean-zero Gaussian process G_F = B∘F, where B is the standard Brownian bridge.^[3] The covariance structure of this Gaussian process is

$\mathrm{E}[\,G_F(t_1)G_F(t_2)\,] = F(t_1\wedge t_2) - F(t_1)F(t_2).$

The uniform rate of convergence in Donsker’s theorem can be quantified by the result, known as the Hungarian embedding: ^[4]

$\limsup_{n\to\infty} \frac{\sqrt{n}}{\ln^2 n} \big\| \sqrt{n}(\hat F_n-F) - G_{F,n}\big\|_\infty < \infty, \quad \text{a.s.}$

Alternatively, the rate of convergence of $\scriptstyle\sqrt{n}(\hat{F}_n-F)$ can also be quantified in terms of the asymptotic behavior of the sup-norm of this expression. Number of results exist in this venue, for example the Dvoretzky–Kiefer–Wolfowitz inequality provides bound on the tail probabilities of $\scriptstyle\sqrt{n}\|\hat{F}_n-F\|_\infty$ : ^[4]

$\Pr\!\Big( \sqrt{n}\|\hat{F}_n-F\|_\infty > z \Big) \leq 2e^{-2z^2}.$

In fact, Kolmogorov has shown that if the cdf F is continuous, then the expression $\scriptstyle\sqrt{n}\|\hat{F}_n-F\|_\infty$ converges in distribution to ||B||_∞, which has the Kolmogorov distribution that does not depend on the form of F.

Another result, which follows from the law of the iterated logarithm, is that ^[4]

$\limsup_{n\to\infty} \frac{\sqrt{n}\|\hat{F}_n-F\|_\infty}{\sqrt{2\ln\ln n}} \leq \frac12, \quad \text{a.s.}$

and

$\liminf_{n\to\infty} \sqrt{2n\ln\ln n} \|\hat{F}_n-F\|_\infty = \frac{\pi}{2}, \quad \text{a.s.}$

References

Shorack, G.R.; Wellner, J.A. (1986). Empirical processes with applications to statistics. New York: Wiley.
van der Vaart, A.W. (1998). Asymptotic statistics. Cambridge University Press. ISBN 978-0-521-78450-4.

Notes

^ van der Vaart (1998, page 265), PlanetMath
^ ^a ^b van der Vaart (1998, page 265)
^ ^a ^b van der Vaart (1998, page 266)
^ ^a ^b ^c van der Vaart (1998, page 268)

Empirical distribution function

Contents

Definition

Asymptotic properties

See also

References

Notes